1. Purpose

The purpose of this exercise is to find groups of items that are frequently purchased together within the given dataset from a large online retailer.

2. Tools Used

The entire task is coded in R, and this document was created with the “R-Markdown” format. All relevant code snippets are shown within this document itself. The raw code can be found in the file “Market_Basket_Analysis.Rmd”.

To find the market baskets I will use association rules, specifically the A Priori algorithm, to find which products are most often bought together.

3. Data Preparation

To begin the analysis, let’s load libraries and import our data:

library(arules)
library(plyr)
library(arulesViz)
library(knitr)

# read in the data
data <- read.csv("http://eesposito.com/project_files/market_basket_analysis/data.csv", header = TRUE)

# convert all columns to a logical variable instead of numeric
nms <- names(data)
nms <- nms[2:length(nms)]
data <- as.data.frame(lapply(X = data[nms], FUN = as.logical))

4. Modeling Process

Now let’s use the A Priori algorithm to find which items are bought together:

# run apriori
rules <- apriori(data[,-1], parameter = list(supp = 0.005, conf = 0.1,
                                            target = "rules", minlen=2,
                                            maxlen=10), control = NULL)

We now have 93 rules, but lots of them redundant. Several rules are simply subsets or inverses of each other, as seen in the example below:

# get rules inspection object
rules_in <- inspect(rules)

# show all rules for items 2, 7, and 29
rules_example <-rules_in[which(rules_in$rhs=="{item_2}" | rules_in$lhs=="{item_2}" |
                                 rules_in$lhs=="{item_29}" | rules_in$rhs=="{item_29}"  |
                                 rules_in$lhs=="{item_7}" | rules_in$rhs=="{item_7}"),]
# show rules
kable(rules_example,format="markdown")
lhs rhs support confidence lift
7 {item_29} => {item_7} 0.24306 0.8588086 3.034230
8 {item_7} => {item_29} 0.24306 0.8587479 3.034230
9 {item_29} => {item_2} 0.24266 0.8573952 3.026884
10 {item_2} => {item_29} 0.24266 0.8566688 3.026884
11 {item_7} => {item_2} 0.24328 0.8595252 3.034404
12 {item_2} => {item_7} 0.24328 0.8588576 3.034404
36 {item_7,item_29} => {item_2} 0.21767 0.8955402 3.161548
37 {item_2,item_29} => {item_7} 0.21767 0.8970164 3.169221
38 {item_2,item_7} => {item_29} 0.21767 0.8947304 3.161368

The 9 rules above all really represent 1 basket. For simplicity, we want to “prune” these rules and only keep the rule that is most useful. In this case, it means taking 1 rule that is the “superset” with the highest measure of interest (in this case I chose support, although the rules were not materially different).

Let’s prune:

# add rule size as an attribute for pruning 
quality(rules)<-cbind(quality(rules),size=size(rules))

# find redundant rule subsets
rules <- sort(sort(rules, by="support",decreasing=TRUE),decreasing=TRUE,by="size")
superset.matrix <- is.subset(rules@lhs,y=NULL,sparse=FALSE)
superset <- rowSums(superset.matrix, na.rm=T) ==1

# prune redundant rule subsets
rules <- rules[superset]

# sort rules
rules <- sort(rules, by="support")

# find redundant body/head rules
subset.matrix <- is.subset(rules, rules)
subset.matrix[lower.tri(subset.matrix, diag=T)] <- NA
redundant <- colSums(subset.matrix, na.rm=T) >= 1

# prune redundant body/head rules
rules <- rules[!redundant]
rules_in <- inspect(rules)

5. Technical Discussion of Results

We are now left with only 3 rules. Let’s look at our new pruned set of rules:

# show rules
kable(rules_in,format="markdown")
lhs rhs support confidence lift size
36 {item_7,item_29} => {item_2} 0.21767 0.8955402 3.161548 3
33 {item_3,item_22} => {item_5} 0.21670 0.8972713 3.176856 3
89 {item_1,item_9,item_35,item_42} => {item_39} 0.14692 0.9017369 3.784041 5

All measures of interest (support, confidence, and lift) look appropriate. Let’s discuss what each of these column headings actually means:

  1. LHS - The Left Hand Side (aka “LHS” or “Body”) represents the known rule input.

  2. RHS - The Right Hand Side (aka “RHS” or “Head”) represents the resulting outcome of the LHS. In a market basket analysis such as this one, whether the item is on the LHS or RHS is inconsequential since we are not interested in investigating a specific item relationship.

  3. Support - In this context, support repesents the percentage of transactions where this market basket was observed with respect to the entire 100000 row dataset.

  4. Confidence - Confidence is the percentage of occurrences where the rule held true (given the LHS, the actual RHS value mathed the rule’s RHS). All confidence scores were very high, which is why I felt comfortable pruning the original 93 rules down to the 3 superset rules.

  5. Lift - Lift is a measure of how much better you can predict the RHS if given the LHS, and is measured against random chance [P(Head|Body)/P(Head)]. All rules had lift significantly greater than 1, indicating that these rules are very useful and much better than random chance.

6. Graphical Summary & Conclusion

Finally, let’s visualize the baskets of items often purchased together:

#plot(rules, method="grouped")
plot(rules, method="graph", control=list(type="items", cex=1.20),
     measure = "support",shading = NA,
     main=paste("Graph of ",length(rules), "Baskets (aka rules)"))

The node graph above best summarizes the results of this exercise by showing the 3 distinct baskets I discovered within the data. The baskets of items are: